What questions to ask your model to determine its fit. We will discuss how to quantify how well a linear regression model fits, how to diagnose problems with the model using visualizations, and how each observation impacts the model.
This is my learning experience of data science through DataCamp
Code
# Import numpy with alias npimport numpy as np# Import seaborn with alias snsimport pandas as pdimport seaborn as sns# Import matplotlib.pyplot with alias pltimport matplotlib.pyplot as plt# Import the ols functionfrom statsmodels.formula.api import ols
Quantifying model fit
Analyze and assess the accuracy of model predictions.
Coefficient of determination: R-squared (1 is the best, 0 is as good as randomness).
The proportion of variance in the response variable that is predictable (explainable) by the explanatory variable. This information indicates whether the model at hand is effective in resuming our data or not. Data, context, and the way we transform variables heavily impact r-squared interpretation.
Accessible inside .summary() or .rsquared
Residual standard error (RSE)
The residual is the difference between the predicted and observed response values (the distance). It has the same unit as the response.
MSE = RSE**2 RSE = np.sqrt(MSE)
Accessible with .mse_resid()
RSE is calculated manually by taking the square of each residual. The degrees of freedom are calculated (# of observations minus # of model coefficients). Then we take the square root of the sum divided by the deg_freedom.
Root mean square error
Unlike MSE, we do not remove degrees of freedom (we divide only by the number of observations).
Coefficient of determination
A coefficient of determination measures how well the linear regression line fits the observed values. It is equal to the square root of the correlation between the explanatory and response variables in a simple linear regression.
Code
# fetch data for which model is createdad_conversion=pd.read_csv('dataset/ad_conversion.csv')ad_conversion.head()
spent_usd
n_impressions
n_clicks
0
1.43
7350
1
1
1.82
17861
2
2
1.25
4259
1
3
1.29
4133
1
4
4.77
15615
3
Code
# click vs impression modelmdl_click_vs_impression_orig = ols('n_clicks ~ n_impressions' , data = ad_conversion).fit()# Print a summary of mdl_click_vs_impression_origprint(mdl_click_vs_impression_orig.summary())# Create qdrt_n_impressions and qdrt_n_clicksad_conversion["qdrt_n_impressions"] = ad_conversion['n_impressions'] **0.25ad_conversion["qdrt_n_clicks"] = ad_conversion['n_clicks'] **0.25# qdrnt click vs impression modelmdl_click_vs_impression_trans = ols('qdrt_n_clicks ~ qdrt_n_impressions ' , data = ad_conversion).fit()# Print a summary of mdl_click_vs_impression_transprint(mdl_click_vs_impression_trans.summary())
OLS Regression Results
==============================================================================
Dep. Variable: n_clicks R-squared: 0.892
Model: OLS Adj. R-squared: 0.891
Method: Least Squares F-statistic: 7683.
Date: Fri, 13 Jan 2023 Prob (F-statistic): 0.00
Time: 09:10:19 Log-Likelihood: -4126.7
No. Observations: 936 AIC: 8257.
Df Residuals: 934 BIC: 8267.
Df Model: 1
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 1.6829 0.789 2.133 0.033 0.135 3.231
n_impressions 0.0002 1.96e-06 87.654 0.000 0.000 0.000
==============================================================================
Omnibus: 247.038 Durbin-Watson: 0.870
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13215.277
Skew: -0.258 Prob(JB): 0.00
Kurtosis: 21.401 Cond. No. 4.88e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: qdrt_n_clicks R-squared: 0.945
Model: OLS Adj. R-squared: 0.944
Method: Least Squares F-statistic: 1.590e+04
Date: Fri, 13 Jan 2023 Prob (F-statistic): 0.00
Time: 09:10:19 Log-Likelihood: 193.90
No. Observations: 936 AIC: -383.8
Df Residuals: 934 BIC: -374.1
Df Model: 1
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 0.0717 0.017 4.171 0.000 0.038 0.106
qdrt_n_impressions 0.1115 0.001 126.108 0.000 0.110 0.113
==============================================================================
Omnibus: 11.447 Durbin-Watson: 0.568
Prob(Omnibus): 0.003 Jarque-Bera (JB): 10.637
Skew: -0.216 Prob(JB): 0.00490
Kurtosis: 2.707 Cond. No. 52.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Code
# Print the coeff of determination for mdl_click_vs_impression_origprint(mdl_click_vs_impression_orig.rsquared)# Print the coeff of determination for mdl_click_vs_impression_transprint(mdl_click_vs_impression_trans.rsquared)print("\n The transformed model has a higher coefficient of determination than the original model, suggesting that it gives a better fit to the data.")
0.8916134973508041
0.9445272817143905
The transformed model has a higher coefficient of determination than the original model, suggesting that it gives a better fit to the data.
Residual standard error
The residual standard error (RSE) measures the typical residual size. Predictions are measured by how wrong they can be. The data fits better with smaller numbers, with zero being perfect
Code
# Calculate mse_orig for mdl_click_vs_impression_origmse_orig = mdl_click_vs_impression_orig.mse_resid# Calculate rse_orig for mdl_click_vs_impression_orig and print itrse_orig = np.sqrt(mse_orig)print("RSE of original model: ", rse_orig)# Calculate mse_trans for mdl_click_vs_impression_transmse_trans = mdl_click_vs_impression_trans.mse_resid# Calculate rse_trans for mdl_click_vs_impression_trans and print itrse_trans = np.sqrt(mse_trans)print("RSE of transformed model: ", rse_trans)
RSE of original model: 19.905838862478134
RSE of transformed model: 0.19690640896875722
Visual model fit
If the model is well fitted, the residuals should be normally distributed along the line/curve, and the mean should be zero. In addition, it indicates when the fitted residuals are positive or negative (above/below the straight line).
Residual VS fitted values chart
Trends can be visualized using this tool. The best accuracy is achieved by following the y=0 line. There is a problem if the curve goes all over the place.
sns.residplot()
Q-Q Plot
The best conditions are validated if the points track along a straight line and are normally distributed. Otherwise, they are not.
qqplot() (from statsmodels.api import qqplot)
Square root of Standardized Residuals VS fitted values, Scale-location plot
As the fitted values change, the residuals change in size and whether they become smaller or larger. If it bounces all over or is irregular, it means residuals tend to vary randomly or in an inconsistent manner as fitted values change.
Drawing diagnostic plots
Let’s draw diagnostic plots using the Taiwan real estate dataset and the model of house prices versus number of convenience stores.
# Plot the residuals vs. fitted valuessns.residplot(x='n_convenience', y='price_twd_msq', data=taiwan_real_estate, lowess=True)plt.xlabel("Fitted values")plt.ylabel("Residuals")# Show the plotplt.show()
Code
# Import qqplotfrom statsmodels.api import qqplot# Create the Q-Q plot of the residualsqqplot(data=mdl_price_vs_conv.resid, fit=True, line="45")# Show the plotplt.show()
Code
# Preprocessing stepsmodel_norm_residuals = mdl_price_vs_conv.get_influence().resid_studentized_internalmodel_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))# Create the scale-location plotsns.regplot(x=mdl_price_vs_conv.fittedvalues, y=model_norm_residuals_abs_sqrt, ci=None, lowess=True)plt.xlabel("Fitted values")plt.ylabel("Sqrt of abs val of stdized residuals")# Show the plotplt.show()
Code
print("Above three diagnostic plots are excellent for sanity-checking the quality of your models")
Above three diagnostic plots are excellent for sanity-checking the quality of your models
Outliers, leverage & influence
Our models can be significantly affected by some special values. A slope can be completely altered by removing one or two extreme datapoints.
Unusual datapoints, outliers, data is extreme
There are two ways to visualize them, they can be either far away from the model fit, or they can be far away from the datapoint along the fit (horizontally / vertically).
Leverage & influence
The concepts of leverage and influence are important when determining whether some unusual data points are negatively affecting your model.
Leverage measures the extreme values of the explanatory variables. Since most of the observations have a short distance to the nearest MRT station, observations with long distances are more extreme. High leverage points are those with explanatory variables that are farthest away from one another.
In modeling, influence is the amount that the model would change if the observation were not included in the dataset. Due to their large residuals and distance from other observations, observations far from the trend line are expected to have a high influence. There was a majority of influential houses with prices that were higher than the model predicted (and one with a lower price).
Extracting leverage and influence
Lets find leverage and influence for taiwan real estate data
Code
# Create sqrt_dist_to_mrt_mtaiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])# Run a linear regression of price_twd_msq vs. sqrt_dist_to_mrt_mmdl_price_vs_dist = ols("price_twd_msq ~ sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()
# Add the hat_diag column to taiwan_real_estate, name it leveragetaiwan_real_estate["leverage"] = summary_info['hat_diag']# Sort taiwan_real_estate by leverage in descending order and print the headprint(taiwan_real_estate.sort_values(by='leverage', ascending=False).head())
# Add the cooks_d column to taiwan_real_estate, name it cooks_disttaiwan_real_estate['cooks_dist'] = summary_info['cooks_d']# Sort taiwan_real_estate by cooks_dist in descending order and print the head.print(taiwan_real_estate.sort_values("cooks_dist", ascending=False).head())